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Abstract 

Motivation. Protein design aims to identify sequences compatible with a given protein fold but incompatible 
to any alternative folds. To select the correct sequences and to guide the search process, a design scoring 
function is critically important. Such a scoring function should be able to characterize the global fitness 
landscape of many proteins simultaneously. 

Results. To find optimal design scoring functions, we introduce two geometric views and propose a formu- 
lation using mixture of nonlinear Gaussian kernel functions. We aim to solve a simplified protein sequence 
design problem. Our goal is to distinguish each native sequence for a major portion of representative protein 
structures from a large number of alternative decoy sequences, each a fragment from proteins of different 
fold. Our scoring function discriminate perfectly a set of 440 native proteins from 14 million sequence decoys. 
We show that no linear scoring function can succeed in this task. In a blind test of unrelated proteins, our 
scoring function misclassfies only 13 native proteins out of 194. This compares favorably with about 3 — 4 
times more misclassifications when optimal linear functions reported in literature are used. We also discuss 
how to develop protein folding scoring function. 

Key words: Protein scoring function; fitness landscape; nonlinear scoring function; kernel models; 
protein design; protein folding; optimization. 



1 Introduction 



The problem of protein sequence desi gn aims to identify sequences compatible w i th a g iven protein fold 
and incompatible to alternative folds iDrexle 3. l!98lHPabol 119831: iDeGrado et all 1199911 1. It is also called 
the inverse protein folding problem . This is a fundamental problem and has attracted considerable interest 
jYue and Dill Il992l IshakhnovichL Il998t iLi et ~aH I199& Eeutsch and Kuroskvt Il996t iKoehl and Levitt! 
ll999allbD . The ultimate goal of protein design is to engineer protein molecules with improved activities 
or with acquired new func tions. There have been many importantdesign studies, including the design 
of novel hydrophobic core iDesiarlais and Handel 119951: IG. A. et aZ.Ul997ft the design and experimental 



validation of an entir e protein for sp ecified backbone ( Dahivat and MavoL Il997l) . the design of a novel 
alpha helical protein dEmberly et "all 120 02). the d esign and validation of a protein adopting a completely 
new fold unseen in nature Ji^rhhramei^U I2003T) . and a soluble analog of membrane potassium channel 
(ISlovic et all\2004l . 
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A successful protein design strategy needs to solve two problems. First, it needs to explore both the 
sequence and structure search space and efficiently generates candidate sequences. Second, a scoring 
function or fitness function needs to identify sequences that are compatible with the desired template 
fold (the " design in" principle) but are incompat i ble wit h any other competing folds (the "design out" 
principle) <Yue and Dili 119921 iKoehl and Levitl ll999eJbh . To achieve this, an ideal scoring function 
would maximize the probabilities of protein sequences taking their native fold, and reduce the probability 
that these sequences take any other fold. Because many protein sequences with low sequence identity 
can adopt the same protein fold, a full-fledged design scoring function should identify all sequences that 
fold into the same desired structural fold from a vast number of sequences that do fold into alternative 
structures, or that do not fold. 

Several design scoring functions have been developed based on physical models . For redesigning 
protein cores, hydrophoicity and packing specificity are the main ingredients of the scoring functions 
foesiarlais and Handej Il995l). Van der Waals in teractions and electrostatics have also been incorpo- 
rated for protein design jKoehl and Levitl Il999allbl) . A combination of terms including Lennard- Jones 
potential, repulsion, Lazaridis-Karplus implicit solvation, approximated electrost atic interactions, and 
hydro gen bonds is used in an insightful computational protein design experiment iKuhlman and Baked 
{200(f) . Models o f solvation energy based on surface area is a k ey component of several other design 
scoring functions (TWernisch et all feood ; iKoehl and Levitdll999allrJ) . 

A variety of empirical scoring functions based on known protein structures have also been developed 
for coarse-grained models of proteins. In this case, proteins are not represented in atomic details but are 
represented at residue level. Because of the coarse-grained nature of the protein representation, these 
scoring functions allow rapid exploration of the search space of the main factors important for proteins, 
and can provide good initial solutions for further refinement where models with atomistic details can be 
used. 

Many empirical scoring functions were originally developed for the purposes of protein folding and 
structure prediction. Because the principles are very similar, they are often used directly for protein 
design. One prominent class of empirical scoring functions are knowledge-ba sed scoring functions, 
which are derived from statistical analysis of database of protein structures llTanaka and Scheraeal 
Il976t iMivazawa and JerniganL ll98St ISamudrala and Mouhl Il99& iLu and Skolnickl l200ll) . Here the in- 
teractions between a pair of residues are estimated from its relative frequency in database when com- 
pared with a reference state or a null model. This approach has found many successfully applications 
jMjvagjrwiianclJejjMga^^^ 

119931 : ISippj ll99St iLemer et all Il995t iJernigan and Baharl 1199(1 ISimons et all 119991 iLi et all 1200311 . 

However, there are several conceptual difficulties with this approach. These include the neglect of chain 
connectivity in the reference state, and the p roblematic implicit assumption of Boltzmann distribution 
(iThomas and Dilill996rl^lBen-Naimlll997li . 

An alternative approach for empirical scoring function is to find a set of parameters such that the scor- 
ing functions are optimized by some criterion, e.g., m aximized score difference between native conforma- 
tion and a set of a l ternative (or decoy ) conformations iGoldstein et all W92]jMaiorov^in^^!nrjD i er j^992 ; 
Thomas and Dili Il996at iTobi et M. l2000l: IVendruscolo and Domanvl Il998t IVendruscolo et all boOOa : 



Bastolla et all l200ll: iDima et qZ.U2000l; iMicheletti et qZll2000ll) . This approach has been shown to be 



effective in fold recognit ion, where native structures can be identified from alternative conformations 
iMicheletti et ql.U2000ll) . However, if a large number of native protein structures are to be simultane- 
ously discriminated again s t a large number of dec oy conformations, no such scoring functions can be 
found llVendruscolo et all l2000al : TTobi et all 1200(f) . Similar conclusion is found in the present study 
for protein design, where we find that no linear design scoring function can simultaneously discriminate 
a large number of native proteins from seqeunce decoys. A recent criticism i s that it is impossib le to 
predict stability changes due to mutation using contact-based scoring function iKhatun et all i2004t) . 

There are three key steps in developing effective empirical scoring function using optimization: (1) the 
functional form, (2) the generation of a large set of decoys for discrimination, and (3) the optimization 
techniques. The initial step of choosing an appropriate functional form is often straightforward. Empirical 
pairwise scorin g functions are u sually all in the form of weighted linear sum of interacting residue pairs 
(see reference iFain et aZl I2002T) for an exception). In this functional form, the weight coefficients are the 
parameters of the scoring function, which are optimized for discrimination. The same functional form 
is also used in statistical potential, where the weight coefficients are derived from database statistics. 
The optimization techniques that have been used include perceptron learning and linear programming 
iTobi et al.U200(f : IVendruscolo et a7ll2000al) . The objectives of optimization are often maximization of 
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score gap between native protein and the average of decoys, or score gap between native and decoys 
with lowest score, or the z-score of the native protein ([Goldstein et all Il992l ; iKoretke et all Il996l . Il998t 
lHao and ScheragaL Il99fil iMirnv and ShakhnovichUl99fih 

In this work, we study a simplified version of the protein design problem. Our goal is to develop a glob- 
ally applicable scoring function for characterizng the fmtness landscape of many proteins simultaneously. 
Specifically, we aim to identify a protein sequence that is compatible with a given three-dimensional 
coarse-grained structure from a set of protein sequences that are taken from protein structures of dif- 
ferent folds. In Conclusion, we discuss how to proceed to develop a full-fledged fitness function that 
discriminate similar and dissimilar sequences adopting the same fold against all sequences that adopt 
different folds and sequences that do not fold (e.g., all hydrophobes). In this study, we do not address 
the problem of how to generate candidate template fold or candidate sequence by searching either the 
conformation space or the sequence space. 

To develop empirical scoring function that improves discrimination of native protein sequence, we 
explore in this study an alternative formulation of protein scoring function, in the form of mixture of 
nonlinear Gaussian kernel functions. We also use a different optimization technique based on quadratic 
programming. Instead of maximizing the score gap, here an objec t ive function related to bounds of 
expected classification error s is optimized dVapnik and Chervonenklsl 11974 IVapnrkl Il995l : iBurgesl fl998 ; 
IScholkopf and Smolal Eiol . 

Experimentation with the nonlinear function developed in this study shows that it can discriminate 
simultaneous 440 native proteins against 14 million sequence decoys. In contrast, we cannot obtain 
a perfect weighted linea r sum scoring function using the stat e-of-the-art interior point solver of linear 
programming following jTobi et all l200d : iMeller et all 120021) . We also perform blind tests for native 
sequence recognition. Taking 194 proteins unrelated to the 440 training set proteins, the nonlinear 
scoring function achieves a success rate of 93.3% in sequence design. This result compares favorably 
with optimal linear scoring func t ion (80.9% and 73.7% success rate) and statistical potential (58.2%) 
llTobi et ad l200d : iBaatolla ' et ad l200ll ; iMivazawa and Jerniganl . Il99ai . 

The rest of the paper is organized as follows. We first describe theory and model of linear and 
nonlinear function, including the kernel model and the optimization technique. We then explain details 
of computation. We further describe experimental results of learning and results of blind test. We 
conclude with discussion about how these ides may be applicable for developing protein folding scoring 
function. 

2 Theory and Models 

Modeling Protein Design Scoring Function. To model protein computationally, we first need 
a method to describe its geometric shape and its sequence of amino acid residues. Frequently, a protein 
is represented by a d- dimensional vector c £ R d . For example, a method that is widely used is to count 
nonbonded contacts of various types of amino acid residue pairs in a protein structure. In this case, 
the count vector c € WL d ,d — 210, is used as the protein descriptor. Once the structural conformation 
of a protein s and its amino acid sequence a is given, the protein description / : (s,a) 1— > M will 
fully determine the d- dimensional vector c. In the case of contact vector, / corresponds to the mapping 
provided by specific contact definition, e.g., two residues are in contact if their distance is below a specific 
cut-off threshold distance. 

To develop scoring functions for our simplified problem, namely, a scoring function that allows 
the search and identification of sequences most compatible with a specific given coarse-grain three- 
dimensional structure, we use a model analogous to the Anfinsen experiments in protein folding. We 
require that the native amino acid sequence a n mounted on the native structure sn has the best (lowest) 
fitness score compared to a set of alternative sequences (sequence decoys) taken from unrelated proteins 
known to fold into a different fold T> = {sjv,a_o} when mounted on the same native protein structure 
sn: 

H(f(sN, ajv)) < H(f(sN, <id)) for all cld £ T>. 

Equivalently, the native sequence will have the highest probability to fit into the specified native struc- 
ture. This is th e same principle described in (Sha khnovich and Gutinlll993tlDeutsch and KuroskvllT995 
iLi et al\. Il996l) . Sometimes we can further require that the score difference must be greater than a con- 
stant b > 0: 

H(f(s N ,a N )) + b < H(f(s N ,a D )) for all (s D ,a N ) £ V. 
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A widely used functional form for protein scoring function H is the weighted linear sum of pairwise con- 
tacts flTaiiaka_and_Schgra^ et aZ.ll200ol : IVendruscolo and Domanvt 
ll99fll : ISamudrara and Mould ll99StlLu and Skolnickt l20o3> . The linear sum score H is: 

H(f(s,a)) = H(c) = wc, (1) 

where "•" denotes inner product of vectors. As soon as the weight vector w is specified, the scoring 
function is fully defined. Much work has been done using this class of desi gn function of linear sum 
of contact pairs iShakhnovich and Gutinj. Il993t fDeutsch and Kuroskvl fl996) . For such linear scoring 
functions, the basic requirement for design scoring function is then: 

w ■ (cjv - c D ) < 0, 

or 

w ■ (c N - c D ) +b < 0, (2) 

if we require that the score difference between a native protein and a decoy must be greater than a real 
value b. The goal here is to obtain a scoring function to discriminate native proteins from decoys. An 
ideal scoring function therefore would assign the value "—1" for native structure/sequence, and the value 
for decoys. 

Two Geometric Views of Linear Protein Folding Potentials. There is a natural geometric 
view of the inequality requirement for weighted linear sum scoring functions. A useful observation is that 
each of the inequalities divides the space of R d into two halfs separated by a hyperplane (Fig^i). The 
hyperplane for Equation is defined by the normal vector (cjv — cd) and its distance 6/||cjv — cd\\ 
from the origin. The weight vector w must be located in the half-space opposite to the direction of the 
normal vector (cjv — cd). This half-space can be written as w ■ (cn — cd) + b < 0. When there are many 
inequalities to be sat isfied simultaneously, the intersection of the half-spaces forms a convex polyhedron 
fcdelsbrunnerlll987l) . If the weight vector is located in the polyhedron, all the inequalities are satisfied. 
Scoring functions with such weight vector w can discriminate the native protein sequence from the set 
of all decoys. This is illustrated in Fig^i for a two-dimensional toy example, where each straight line 
represents an inequality W ■ (cn — cn) + b < that the scoring function must satisfy. 

For each native protein i, there is one convex polyhedron V% formed by the set of inequalities associated 
with its decoys. If a scoring function can discriminate simultaneously n native proteins from a union of 
sets of sequence decoys, the weight vector w must be located in a smaller convex polyhedron V that is 
the intersection of the n convex polyhedra: 

n 

w e v = pi Vi. 

i=l 

There is yet another geometric view of the same inequality requirements. If we now regard (cjv — cd) 
as a point in R d , the relationship w ■ (cn — cd) + b < for all sequence decoys and native proteins 
requires that all points {cjv — cd} are located on one side of a different hyperplane, which is defined by 
its normal vector w and its distance 6/||u>|| to the origin (Fig^>). We can show that such a hyperplane 
exists if the origin is not contained within the convex hull of the set of points {cjv — cd} (see Appendix). 

The second geometric view looks very different from the first view. However, the second view is dual 
and mathematically equivalent to the first geometric view. In the first view, a point cjv — cd determined 
by the structure-decoy pair cjv = (sjv, ajv) and cd = (sjv, an) corresponds to a hyperplane representing 
an inequality, a solution weight vector w corresponds to a point located in the final convex polyhedron. 
In the second view, each structure-decoy pair is represented as a point cjv - cd in R d , and the solution 
weight vector w is represented by a hyperplane separating all the points C = {cjv — cd} from the origin. 

Optimal Linear Scoring Function. Several optimization methods have been applied to find the 
weight vector w of linear scoring fu nction. The Rosenblantt perceptron method works by iteratively 
updating an initial weight vector wo llVendruscolo and Domanvlll998l : lMicheletti et aZlbOOOlTl . Starting 
with a random vector, e.g., wo = 0, one tests each native protein and its decoy structure. Whenever 
the relationship w ■ (cjv — cd) + b < is violated, one updates w by adding to it a scaled violating 
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Figure 1: Geometric views of the inequality requirement for protein scoring function. Here we use a two-dimensional toy 
example for illustration, (a). In the first geometric view, the space R 2 of w = (wi,W2) is divided into two half-spaces 
by an inequality requirement, represented as a hyperplane w ■ (cat — Cd) + b < 0. The hyperplane, which is a line in R 2 , 
is defined by the normal vector (cjv — cd), and its distance 6/||cjv — cd\\ from the origin. In this figure, this distance is 
set to 1.0. The normal vector is represented by a short line segment whose direction points away from the straight line. 
A feasible weight vector w is located in the half-space opposite to the direction of the normal vector (cjv — Co). With 
the given set of inequalities represented by the lines, any weight vector w located in the shaped polygon can satisfy all 
inequality requirement and provides a linear scoring function that has perfect discrimination, (b). A second geometric view 
of the inequality requirement for linear protein scoring function. The space M 2 of x = (xi,X2), where x = (cjv — cd), is 
divided into two half-spaces by the hyperplane w ■ (cjv — cd) + b < 0. Here the hyperplane is defined by the normal vector 
w and its distance b/||iu|| from the origin. All points {cm — cd} are located on one side of the hyperplane away from 
the origin, therefore satisfying the inequality requirement. That is, a linear scoring function w such as the one represented 
by the straight line in this figure can have perfect discrimination, (c). In the second toy problem, a set of inequalities are 
represented by a set of straight lines according to the first geometric view. A subset of the inequalities require that the 
weight vector w to be located in the shaded convex polygon on the left, but another subset of inequalities require that 
w to be located in the dashed convex polygon on the top. Since these two polygons do not intersect, there is no weight 
vector w that can satisfy all inequality requirements. That is, no linear scoring function can classify these decoys from 
native protein, (d). According to the second geometric view, no hyperplane can separate all points {cjv — cd} from the 
origin. But a nonlinear curve formed by a mixture of Gaussian kernels can have perfect separation of all vectors {cm — cd} 
from the origin: It has perfect discrimination. 
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vector r] ■ (cjv — cd). The final weight vector is therefore a linear combination of protein and decoy count 
vectors: 

w — ^2 vicN - cn) = 22 QjvCat — 22 a D c D - (3) 

JV6JV DeT> 

Here N is the set of native proteins, and T> is the set of decoys. The set of coefficients {ajv} U {cud} 
gives a dual form representation of the weight vector w, which is an expansion of the training examples 
including both native and decoy structures. 

According to the first geometric view, if the final convex polyhedron V is non-empty, there can be infi- 
nite number of choices of w, all with perfect discrimination. But how do we find a weight vector w that is 
optimal? This depends on the criterion for optimality. For example, one can choose the weight vector w 
that minimizes the variance of score gaps between decoys and natives: arg^ min ^ ^ (w ■ (cjv — Co)) 2 ~ 

JW\ ( w ' ( Cn ~ c -°))] as used m reference llTobi et all l200fH) . or minimizing the Z-score of a 
large set of native proteins, or minimizing the Z-score of the native protein and an ensemble of de- 
coys dChiu and Goldsteinl Il998t iMirnv and Shakhnovichl |l£l96) , or maximizing the ratio R between the 
width of the distri bution of the score and the average score diffe rence between the native state and 
the unfolded ones iGoldstein et all Il992t lHao and Scheraea l, Il999l) . A series of important works using 
perceptron learn i ng and other optimization techn i ques ^Viedrichs and Wolvnesl Il989l : iGoldstein et all 
ll992tlTobi et all \200& IVendruscolo and Domanvl Il998t iDima et all l200d) showed that effective linear 
sum scoring functions can be obtained. 

Here we describe yet another optimality criterion according to the second geometric view. We can 
choose the hyperplane (w, b) that separates the points {cjv — c_d} with the largest distance to the origin. 
Intuitively, we want to characterize proteins with a region defined by the training set points {cjv — c_d}. It 
is desirable to define this region such that a new unseen point drawn from the same protein distribution 
as {cjv — c_d} will have a high probability to fall within the defined region. Non-protein points following a 
different distribution, which is assumed to be centered around the origin when no a priori information is 
available, will have a high probability to fall outside the defined region. In this case, we are more interested 
in modeling the region or support of the distribution of protein data, rather than estimating its density 
distribution function. For linear scoring function, regions are half-spaces defined by hyperplanes, and 
the optimal hyperplane (w, b) is then the one with maximal distance to the origin. This is related to the 
novelty detection problem and single-c l ass support vector machine studied in statistical learning theory 
dVapnik and Chervonenkislll964ll97llScholkopf and Smolal 120021) . In our case, any non-protein points 
will need to be detected as outliers from the protein distribution characterized by {cjv — cd}. Among 
all linear functions derived from the same set of native proteins and decoys, an optimal weight vector w 
is likely to have the least amount of mislabellings. The optimal weight vector w can be found by solving 
the following quadratic programming problem: 

Minimize ±\\w \ | 2 (4) 

subject to io • (cat - c D ) + b < for all N G M and D G V. (5) 

The solution maximizes the distance &/||io|| of the plane (w,b) to the origin. We obtained the solution 
by solving the following support vector machine problem: 

Minimize |||w|| 2 

subject to w ■ cjv + d < —1 (6) 
W ■ CD + d > 1, 

where d > 0. Note that a solution of Problem © satisfies the constraints in Inequalities ©, since 
subtracting the second inequality here from the first inequality in the constraint conditions of @ will 
give us w ■ (cjv — cd) + 2 < 0. 

Nonlinear Scoring Function. However, it is possible that the weight vector w does not exist, i.e., 
the final convex polyhedron V = n™=i ^ mav De an empty set. First, for a specific native protein i, there 
may be severe restriction from some inequality constraints, which makes V% an empty set. Some decoys 
are very difficult to discriminate due to perhaps deficiency in protein representation. In these cases, 
it is impossible to adjust the weight vector so the native protein has a lower score than the sequence 
decoy. Figure shows a set of inequalities represented by straight lines according to the first geometric 
view. A subset of inequalities (black lines) require that the weight vector w to be located in the shaded 
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convex polygon on the left, but another subset of inequalities (green lines) require that w to be located 
in the dashed convex polygon on the top. Since these two polygons do not intersect, there is no weight 
vector that can satisfy all these inequality requirements. That is, no linear scoring function can classify 
all decoys from native protein. According to the second geometric view (Figure 01), no hyperplane can 
separate all points (black and green) {cjv — cd} from the origin. 

Second, even if a weight vector w can be found for each native protein, i.e., w is contained in a 
nonempty polyhedron, it is still possible that the intersection of n polyhedra is an empty set, i.e., no 
weight vector can be found that can discriminate all native proteins against the decoys simultaneously. 
Computationally, th e question whether a solution weight vector w exists can be answered unambiguously 
in polynomial time (iKarmarkait Il984t) . and results described later in this study show that when the 
number of decoys reaches millions, no such weight vector can be found. 

A fundamental reason for this failure is that the functional form of linear sum is too simplistic. It 
has been suggested that additional decriptors of protein structures such as high er order interactions (e.g., 
three-body or four-body contacts) should be incorpo rated in protein description feetancourt and Thirumalail . 
ll999l : lMunson and Singhl Il997t IZheng et all 119971) . Functions with polynomial terms using upto 6 de- 
gree of Chebyshe v expansion has also been used to represent pairwise interactions in protein folding 
tFain et aZll2002h . 

Here we propose an alternative approach. In this study we still limit ourselves to pairw ise contact in- 
terac tions, although it can be naturally extended to include three or four body interactions iLi and Lianel 
l2004f) . We introduce a nonlinear scoring function analogous to the dual form of the linear function in 
Equation ©, which takes the following form: 

H(f(s,a)) = H(c) = J2 a D K(c,c D ) - ^ a N K(c,c N ), (7) 

where an > and ajv > are parameters of the scoring function to be determined, and cd = f(sN,a,D) 
from the set of decoys T> = {(sat, o,t>)} is the contact vector of a sequence decoy D mounted on a native 
protein structure sm, and cat = /(sjvjOjv) from the set of native training proteins N = {(sn,o,n)} is 
the contact vector of a native sequence ajv mounted on its native structure sn. In this study, all decoy 
sequence {an} are taken from real proteins possessing different fold structures. The difference of this 
functional form from linear function in Equation is that a kernel function K(x,y) replaces the linear 
term. A convenient kernel function K is: 

K(x,y) — e ~H a ' _ yll l 2a for any vectors x and y € Af\JD, 

where a 2 is a constant. Intuitively, the surface of the scoring function has smooth Gaussian hills of 
height q_d centered on the location cd of decoy protein D, and has smooth Gaussian cones of depth om 
centered on the location cat of native structures N. Ideally, the value of the scoring function will be —1 
for contact vectors cjv of native proteins, and will be +1 for contact vectors cd of decoys. 

Optimal Nonlinear Scoring Function. To obtain the nonlinear scoring function, our goal is 
to find a set of parameters {ao,ctN} such that H(f(sN,o,N)) has value close to —1 for native pro- 
teins, and the decoys have values close to +1. There are many different choices of jap , a at |. We use 
an optimality criterion orig inally developed in statistical learning theory JVapnikl Il99l : iBureesl Il99i 
Scholkopf and Smolal 12002) . First, we note that we have implicitly mapped each structure and decoy 
from R 210 through the kernel function of K(x,y) = e~" a ' - 2/" / 2cr to another space with dimension as 
high as tens of millions. Second, we then find the hyperplane of the largest margin distance separating 
proteins and decoys in the space transformed by the nonlinear kernel. That is, we search for a hyperplane 
with equal and maximal distance to the closest native proteins and the closest decoys in the transformed 
high dimensional space. Such a hyperplane can be found by obtaining the parameters {an} and {on} 
from solving the following Lagrange dual form of quadratic programming problem: 

Maximize E ieA AjD, a i ~ 5 £;j€AAjt> y 1 y 3 a> i a J e-^ Ci - c i l|2/2<j2 
subject to < <Xi < C, 



where C is a regularizing constant that limits the i nfluence of each misclassif ied prote i n or decoy 
(|VaDnik and Chervonenkisl Il964l. Il97i IVaonikl Hflol iBurgesl Il99ij: IScholkopf and Smolal I2002I) . and 
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Figure 2: Decoy generation by gapless threading. Sequence decoys can be generated by threading the sequence 
of a larger protein to the structure of an unrelated smaller protein. 

j/i = — 1 if z is a native protein, a nd ig = +1 if i is a decoy. Th e se pa r ameters lead to optimal dis- 
crimination of an un s een t est set ijVapnik and Chervonenkia. Il964l . Il974t IVapnikt Il99.4 IfBurgesL ll99St 
IScholkopf and Smolal l2002f) . When projected back to the space of M 210 , this hyperplane becomes a 
nonlinear surface. For the toy problem of Figure Q Figure ^1 shows that such a hyperplane becomes 
a nonlinear curve in K 2 formed by a mixture of Gaussian kernels. It separates perfectly all vectors 
{cjv — cd} (black and green) from the origin. That is, a nonlinear scoring function can have perfect 
discrimination. 



3 Computational Methods 

Alpha Contact Maps. Because protein molecules are formed by thousands of atoms, their shapes 
are complex. In this study we use t he count vector of pairwise contact inte ractions after normalization 
by the chain length of the protein l|Edelsbrunneil 1199,4 ILiane ef aZjjT^Sl ). He re contacts are derived 
from the edge simplices of the alpha shape of a protein structure (lLie^Zll2003f) . These edge simplices 
represent nearest neighbor interactions that are in physical contacts. They encode precisely the same 
contact information as a subset of the edges in the Voronoi diagram of the protein molecule. These 
Voronoi edges are shared by two interacting atoms from different residues, but intersect with the body of 
the molecu le modeled as t he union of atom balls. St atistical potential based on edge simplic es has been 
developed (iLi et all 120031) . We refer to references fedelsbrunnerl [l995l : ILiane et all Il998f) for further 
theoretical and computational details. 



Generating Sequence Decoys by Threading. Maiorov and Crippen i ntrod uced the gapless 
threading method to generate a large number of decoys (Mai orov and CrippenL IT992V The sequence 
of a smaller protein o,n is threaded through the structure of an unrelated larger protein and takes the 
confo rmation sd of a fragment with the same length from the larger protein dMaiorov and Crippenl 
119921) . Along the way, the sequence of the smaller protein can take the conformations of many fragments 
of the larger protein, each becomes a structure decoy. 
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We can generate sequ ence decoys in an analogous way, as already suggested in lljones et all Il992t 
iMunson and Singrl |l997V We thread the sequence of a larger protein through the structure of a smaller 
protein, and obtain sequence decoys by mounting a fragment of the sequence of the large protein to the 
full structure of the small protein. We therefore have for each native protein (sjv, ajv) a set of sequence 
decoys (sn,o,d) (FigHJ- Because all native contacts are retained in this case, sequence decoys obtained 
by gapless threading are far more challenging than structure decoys generated by gapless threading. 

Protein Data. Followi ng reference dVen druscol o et all feoOOtJ) . we use protein structures contained 
in the Whatif database IjVriend and Sander! 119931) in this study. Whatif database contains a rep- 
resentative set of sequence-unique protein structures generated from X-ray crystallography. Structures 
selected for this study all have pairwise sequence identity < 30%, R-factor < 0.21, and resolution < 2.lA. 
Whatif database c ontains less structur e s than Pdbselect because the R-factor and resolution criteria 
are more stringent dVriend and Sandeil I1993T) . Nevertheless, it provides a good representative set of 
currently all known protein structures. 

We use a list of 456 proteins kindly provide d by Dr. Vendruscolo, whi ch was compiled from the 1998 
release (Whatif98) of the Whatif database llVendruscolo et all l2000aT) . There are 192 proteins with 
multiple chains in this dataset. Some of them have extensive interchain contacts. For these proteins, it is 
possible that their conformations may be different if there are no interchain contacts present. We use the 
criterion of Contact Ratio to remove proteins that have extensive interchain contacts. Contact Ratio is 
defined here as the number of interchain contacts divided by the total number of contacts a chain makes. 
For example, protein lept has four chains A,B,C, and D. The intra chain contact number of chain B 
is 397. Contacts between chain A and chain B is 178, between B and C is 220, between B and other 
heteroatoms is 11. The Contact Ratio of chain B is therefore (178+220+ll)/(397+178+220 + ll) = 51%. 
Thirteen protein chains are removed because they all have Contact Ratio > 30%. We further remove 
three proteins because each has > 10% of residues missing with no coordinates in the Protein Data Bank 
file. The remaining set of 440 proteins are then used as training set for developing both folding and 
design scoring functions. Using threading method described earlier, we generated a set of 14,080,766 
sequence decoys. 



Learning Linear Scoring Function. For comparison, we have also developed o ptimal linear scor - 
ing function following the method and computational procedure described in reference jTobi et aZ.L |200(J) . 
We apply the interior point method as implemented in BPMD package by Meszaros (Meszaros, 1996) 
to search for a w eight vector w. We use two different optimization criteria as described in reference 
llTobi et adl200Cft . The first is: 

Identify w 
subject to w ■ (cjv — cn) < e and |iu<| < 10, 



where Wi denotes the i-th component of weight vector w, and e = 1 X 10 6 . Let C = {cjv - co}, and \C\ 
the number of decoys. The second optimization criterion is: 

Minimize min ^ (w ■ (c N - c D )) 2 - ( w ' ( c n - Co))J 

subject to w ■ (cn — cd) < e. 



Learning Non linear Kernel Scoring Function. We use S VMlight ( http : //svmlight . j oachims . org/ ) 

llJoachimslll999D with Gaussian kernels and a training set of 440 native proteins plus 14,080,766 decoys 
to obtain the optimized parameter {otN,otD}- The regularization constant C takes default value, which 
is estimated from the training set AfUT>: 



C = \ATUT>\ 2 / 



V K ( x i x)-2- K(x, 0) + K{0, 0) 



.XeAfUT) 



(8) 



Since we cannot load all 14 millions decoys into comput er memory simult aneously, we use a heuristic 
strategy for training. Similar to the procedure reported in ijTobi et all\200(h . we first randomly selected 



Table 1: Details of derivation of nonlinear kernel design scoring functions. The numbers of native proteins and 
decoys with non-zero a* entering the scoring function are listed. The range of the score values of natives and 
decoys are also listed, as well as the range of the smallest gaps between the scores of the native protein and decoy. 
Details for nonlinear kernel folding scoring functon are also listed. 





Design Scoring Function 


Folding Scoring Function 


o* = 416.7 


a* = 227.3 


Num. of 
Vectors 


Natives 


220 


214 


Decoys 


1685 


1362 


Range of 
Score Values 


Natives 


0.9992 ~ 4.598 


0.9990 ~ 4.215 


Decoys 


-9.714 ~ 0.7423 


-6.859 ~ 0.3351 


Range of Smallest Score Gap 


0.2575 ~ 11.53 


0.8446 ~ 9.816 



a subset of decoys that fits into the computer memory. Specifically, we pick every 51st decoy from the 
list of 14 million decoys. This leads to an initial training set of 276,095 decoys and 440 native proteins. 
An initial protein scoring function is then obtained. Next the scores for all 14 million decoys and all 440 
native proteins are evaluated. Three decoy sets were collected based on the evaluation results: the first 
set of decoys contains the violating decoys which have lower score than the native structures; the second 
set contains decoys with the lowest absolute score, and the third set contains decoys that participate 
in H(c) as identified in previous training process. The union of these three subsets of decoys are then 
combined with the 440 native protein as the training set for the next iteration of learning. This process 
is repeated until the score difference to native protein for all decoys are greater than 0.0. Using this 
strategy, the number of iterations typically is between 2 and 10. During the training process, we set 
the cost factor j in SVMlight to 120, which is the factor training errors on native proteins outweighs 
training errors on decoys. 

The value of a 2 for the Gaussian kernel K(x, y) = e~\\ x ~U\\ / 2cr is chosen by experimentation. If the 
value of a 2 is too large, no parameter set {ajv,QD} can be found such that the fitness scoring function 
can perfectly classifies the 440 training proteins and their decoys, i.e., the problem is unlearnable. If 
the value of a 2 is too small, the performance in blind-test will deteriorate. The final final design scoring 
function is obtained with a 2 set to 416.7. 

4 Results 

Linear Design Scoring Functions. To search for the optimal weight vector w for design scoring 
function, we use linear pro gramming solver based on interior point method as implemented in BPMD by 
Meszaros (Mcszaros, 1996). After generating 14,080,766 sequence design decoys for the 440 proteins in 
the training set, we search for an optimal w that can discriminate native sequences from decoy sequences. 
That is, we search for parameters w for H (s, a) = w ■ c, such that w ■ cjv < w ■ cd for all sequences. 
However, we fail to find a feasible solution for the weight vector w. That is, no w exists capable 
of discriminating perfectly 440 native sequences from the 14 million d ecoy sequences. W e repeated the 
same experiment using a larger set of 572 native proteins from reference jTobi e t at, 2000) and 28,261,307 
sequence decoys. The result is also negative. 

Nonlinear Kernel Scoring Function. To overcome the problems associated with linear function, 
we use the set of 440 native proteins and 14 million decoys to derive nonlinear kernel design functions. 
We succeeded in finding a function in the form of Equation (|7J that can discriminate all 440 native 
proteins from 14 million decoys. 

Unlike statistical scoring functions where each native protein in the database contribute to the em- 
pirical scoring function, only a subset of native proteins contribute and have on 0. In addition, a 
small fraction of decoys also contribute to the scoring function. Tabled list the details of the scoring 
function, including the numbers of native proteins and decoys that participate in Equation J7J. These 
number represent about 50% of native proteins and < 0.1% of decoys from the original training data. 
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Table 2: The number of misclassified protein sequences for the test set of 194 proteins and the set of 201 proteins 
using nonlinear kern el design scoring funct ion, two optimal linear scoring function taken a s reported in (iTobi et af. 



and Miyazawa-Jernigan statistical potential (Mivazawa and Jernigan, 



2000), in Table I of (Ba£ 

1996). The nonlinear kernel design scoring function has the best performance in blind test and is the only function 
that succeeded in perfect discrimination of the 440 native sequences from a set of 14 million sequence decoys. 





Misclassified Natives 


Misclassified Natives 


Kernel Design Scoring Function 


13/194 


19/201 


Tobi & Elber 


37/194 


44/201 


Bastolla et al 


51/194 


54/201 


Miyazawa & Jernigan 


81/194 


87/201 



Discrimination Tests for Design Scoring Function. Blind test in discriminating native pro- 
teins from decoys for an independent test set is essential to assess the effectiveness of design scoring 
functions. To construct a test set, we first take the entries in Whatif99 database that are not present 
in Whatif98. After eliminating proteins with chain length less than 46 residues, we obtain a set of 
201 proteins. These proteins all have < 30% sequence identities with any other sequence in either the 
training set or the test set proteins. Since 139 of the 201 test proteins have multiple chains, we use the 
same criteria applied in training set selection to exclude 7 proteins with > 30% Contact Ratio or with 
> 10% residues missing coordinates in the PDB files. This leaves a smaller set of test proteins of 194 
proteins. Using gapless threading, we generate a sets of 3,096,019 sequence decoys from the set of 201 
proteins. This is a superset of the decoy set generated using 194 proteins. 

To test design scoring functions for discriminating native proteins from sequence decoys in both the 
194 and the 201 test sets, we take the sequence a from the conformation-sequence pair (sjv,a) for 
a protein with the lowest score as the predicted sequence. If it is not the native sequence ajv, the 
discrimination failed and the design scoring function does not work for this protein. 

For comparison, we also test the d iscrimination results of optimal linear scoring function taken as 
reported in reference jTobi et q*J,l200Ch . as well as the stat istical potential d eveloped by Miyazawa and 
Jernigan. Here we use the contact definition reported in jTobi et all l200ol) . that is, two residues are 
declared to be in contact if the geometric centers of their side chains are within a distance of 2.0 - 6.4 A. 

The nonlinear design scoring function capable of discriminating all of the 440 native sequences also 
works well for the test set (Table|5J. It succeeded in correctly identifying 93.3% (181 out of 194) of native 
sequences in the independent test set of 194 proteins. This com pares favorably w ith results obtained 
using optimal linear folding scoring function taken as reported in llTobi et al.U2000|) . which succeeded in 
identifying 80.9% (157 out of 194) of this test set. It also has better performan ce than optima l linea r 
scoring function based on calculations using parameters reported in reference (IBastolla et "alt 1200 ll) . 
which succeeded in identifying 73.7% (143 out of 194) of proteins in the test set. The Miyazawa-Jernigan 
statistical potential succeeded in identifying 113 native proteins out of 194) (success rate 58.2%). 

Discirmintating Dissimilar Proteins. As any other discrimination problems, the success of clas- 
sification strongly depends on the training data. If the scoring function is challenged with a drastically 
different protein than proteins in the training set, it is possible that the classification will fail. To further 
test how well the nonlinear scoring function performs when discriminating proteins that are dissimilar 
to those contained in the training set, we take five proteins that are longer than any training proteins 
(lengths between 46 and 688). These are obtained from the list of 1,261 polypeptide chains contained in 
the updated Oct 15, 2002 release of Whatif database. The first test is to discriminate the 5 proteins from 
1,728 exhaustively generated design decoys using gapless threading. The second test is to discriminate 
these 5 proteins from exhaustively enumerate sequence decoys generated by threading 14 large protein 
sequences of unknown structures obtained from SwissProt database, whose sizes are between 1,124 and 
2,459. This is necessary since structures of the longest chains otherwise have few or no threading decoys. 
Tabel 13 lists results of these test, including the predicted score value and the smallest gap between the 
native protein and decoys. For the first test, the nonlinear design scoring functions can discriminate 
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Figure 3: The distribution of maximum distances of proteins to the set of training proteins, (a). The maximum 
distance for each training protein to all other 439 proteins, (b). The maximum distance for each protein in the 
201 test set to all 440 training proteins. These two distributions are similar. 



these 5 native proteins from all decoys in the first test. For the second test, the design scoring function 
can also discriminate all 5 proteins from a total of 53,565 SwissProt sequence decoys, and the smallest 
score gaps between native and decoys are large. 

We found that it is infrequent for an unknown test protein to have low similarity to all reference 
proteins. For each protein in the 440 training set, we calculate its Euclidean distance to the other 439 
proteins. The distribution of the 440 maximum distances for each training protein to all other 439 
proteins are shown in Figure We also calculate for each protein in the 201 test set its maximum 
distance to all training proteins (Figure^). It is clear that for most of the 201 test proteins, the values 
of maximum distances to training proteins are similar to the values for training set proteins. The only 
exceptions are two proteins, ribonuclease inhibitor (la4y.a) and formaldehyde ferredoxin oxidoreductase 
(lb25.a). Although they are correctly classified, the former has significant amount of unaccounted 
interchain contact with another protein angiogenin, and the latter has iron/sulfur clusters. It seems that 



Table 3: Discrimination of five large proteins against (a) design decoys and (b) folding decoys generated by 
gapless threading, and against (c) additional design decoys generated by threading unrelated long proteins 
(length from 1 124 to 2 459) to the structures of these five proteins. Here pdb is the pdb code of the protein 
structure, N is the size of protein, n is the number of decoys, H is the predicted value of the scoring function, 
A score is the smallest gap of score between the native protein and its decoys. The results show that all decoys 
can be discriminated from natives, and the smallest score gaps between native and decoys are large. 



pdb 




N 


n 


"Design Decoy 
by KDF 


b Folding Decoy 
by KFF 


c SwissProt Decoy 
by KDF 










H 


^score 


H 


^score 


n 


H 


^score 


lcsO. 


a 


1073 





2.67 


N/A 


2.31 


N/A 


8 232 


2.67 


2.42 


Ig8k. 


a 


822 


545 


2.07 


4.18 


1.49 


4.71 


11 997 


2.07 


1.69 


igqi- 


a 


708 


1002 


3.03 


5.16 


2.82 


5.03 


13 707 


3.03 


2.16 


lkqf . 


a 


981 


93 


2.19 


5.17 


1.85 


4.95 


9 612 


2.19 


1.82 


llsh. 


a 


954 


148 


1.97 


4.57 


1.66 


4.02 


10 017 


1.97 


2.01 
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Table 4: The nearest neighbors of the 13 proteins misclassified by design function. The number of native protein 
support vectors among the top 3, 5, and 11 nearest neighbors (NNs) are listed. Except protein lbx7, the majority 
of nearest neighbors of all misclassified proteins are decoys. 



pdb 


3-NN 


5-NN 


11-NN 


lbd8 











lbx7 


2 


3 


5 


lbxy.a 





1 


1 


lcku.a 


1 


2 


2 


ldpt.a 


2 


2 


2 


Iflt.v 


1 


3 


3 


lhta 











lmro.c 











lops 


1 


1 


1 


lpsr.a 


1 


1 


1 


lrb9 


1 


1 


1 


lubp.b 


1 


1 


1 


3ezm.a 












the set of training proteins provide an adequate basis set for characterizing the global fitness landscape 
of sequence design for other proteins. 

Nature of Misclassification. We further distinguish misclassifications due to native protein being 
too close to a decoy and misclassifications due to decoys being too close to a native protein. Among the set 
of 201 test proteins, the native sequences of 13 proteins are not recognized correctly from design decoys. 
These 13 proteins are truly misclassifications because they do not have extensive unaccounted interchain 
interactions or cofactor interactions. We calculate the Euclidean distance of each of the 13 proteins 
from the 220 native protein and 1,685 decoys that participate in the kernel design scoring function. The 
results are shown in Table [I] where the number of native proteins among the top 3, 5, and 11 nearest 
neighboring vectors to the failed protein are listed. Except protein lbx7, all misclassifications are due to 
native vectors being too close to decoys. 



5 Discussion 

Formulation of Non-linear Scoring Function. A basic requirement for computational studies of 
protein design is an effective scoring function, which allows searching and ide ntifying sequences adopting 
the desired structura l templates. Ou r stud y follows earlier works such as dVendruscolo et all l2000bl : 
iTobi and Elbeit l200Ct iGoldstein et all Il992f ) , where empirical scoring functions based on coarse residue 
level representation have been developed by optimization. The goal of this study is to explore ways to 
improve the sensitivity and/or specificity of discrimination. 

There are several routes towards improving empirical scoring functions. One approach is to intro- 
duce higher order inter actions, where three-body or four-body interactions are explicitly incorporated 
in the scoring function l| Zheng^i all Il997t iMunson and Singhl 11997k iBetancourt and Thirumalal 1 19991 : 
Ross i et al l l200ll : lLi et all 12003!) . A different approach is to introduce nonlinear terms. Recently, Fain 
et al uses su ms of Chebyshev polynomials upto order 6 for hydrophobic burial and each type of pairwise 
interactions jFain et aUl2002li . 

In this work, we propose a different framework for developing empirical protein scoring functions, 
with the goal of simultaneous characterization of fitness landscapes of many proteins. We use a set of 
Gaussian kernel functions located at both native proteins and decoys as the basis set. Decoy set in 
this formulation are equivalent to the reference state or null model used in statistical potential. The 
expansion coefficients {ajv},iV 6 J\f and {q_d},D £ T> of the Gaussian kernels determine the specific 
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Table 5: The number of misclassified protein structures for the test set of 194 proteins and the set of 20 1 proteins u sing 
nonlinear kernel folding scoring function, two optimal linear scoring function taken a s reported in (Tobi eta/., 2000|), in 
Table I of ( Bast olla et all I2001T ). and Miyazawa-Jernigan statistical potential (Mivazawa and Jerniean, 1996). The set of 
201 proteins include those with more than 30% interchain contacts and those with > 10% missing coordinates. We also 
list performance of kernel design scoring function for structure recognition. 





Misclassified Natives 


Misclassified Natives 


Kernel Folding Scoring Function 


4/194 


8/201 


Tobi & Elber 


7/194 


13/201 


Bastolla et al 


2/194 


5/201 


Miyazawa & Jcrnigan 


85/194 


92/201 


Kernel Design Scoring Function 


4/194 


9/201 



form of the scoring function. Since native proteins and decoys are non-redundant and are represented as 
unique vectors c £ M , the Gram matrix of the kernel function is full-rank. Therefore, the kernel function 
effectively maps the protein space into a high dimensional space in which effective discrimination with 
a hyperplane is easier to obtain. The optimization criterion here is not Z-score, rather we search for 
the hyperplane in the transformed high dimensional space with maximal separation distance between 
the native protein vectors and the decoy vectors. This choice of optimality criterion is firmly rooted in 
a large body of studies in statistical learning theory, where expected number of errors in classification 
of unseen future test data is minimized probabilistically by balancing the minimization of the training 
error (or empirical risk) and the control of the capacity of specific types of functional form of the scoring 
function ijVannikl Il99,4 iBurged. ll99Sl: IScholkopf and SmolaL l2002h . 

This approach is general and flexible, and can accommodate other protein representations, as long 
as the final descriptor of protein and decoy is a d-dimensional vector. In addition, different forms of 
nonlinear functions can be designed using different kernel functions, such as polynomial kernel and 
sigmoidal kernels. It is also possible to adopt different optimality criterion, for example, by minimizing 
the margin distance expressed in 1-norm instead of the standard 2-norm Euclidean distance. 

Folding Scoring Fucntion. The geometric views of design scoring function and the optimality 
criterion also apply to the protein folding problem. For folding scoring function, the only difference from 
design scoring function of Equation {0 is that here D is a set of structure decoys rather than a set of 
sequence decoys. Specifically, we generate for each native protein (sn,o,n) a set of structure decoys 
{(sd,o,v)}, i.e., by mounting the native sequence on fragment of the structure of a large protein such 
that it contains exactly the same number of amino acid residues as the native protein. We use the same 
training set of 440 protiens from Whatif98 and 14,080,766 structural decoys as in design study. The 
same optimization technique of margin maximization is used. The a 2 value and the number of proteins 
and decoys entering the final folding scoring function are listed in Tabled 

For com parison, we also re port discrimination results of the optimal linear scoring function taken as 
reported in jTobi et aZ.L 12000 1 . as well as the st atistical potential developed by Miyazawa and Jernigan. 
Here we use the contact definition reported in IITobi et all 2000), that is, two residues are declared to 
be in contact if the geometric centers of their side chains are within a distance of 2.0 - 6.4 A. 

To test nonlinear folding scoring functions for the same 194 and 201 test set proteins, we take the 
structure s from the conformation-sequence pair (s,ajv) with the lowest score as the predicted structure 
of the native sequence. If it is not the native structure sjv, the discrimination failed and the folding scoring 
function does not work for this protein. The results of discrimination are summarized in Table |5] There 
are 4 and 8 misclassified native structures for the 194 set and 201 set, respectively. These correspond 
to a failure rate of 2.1% and 4.0%, respectively. The performance of t he optimal nonlin ear kernel 
folding scoring function is better tha n the optimal line ar scoring function of IITobi et aLL 120001) . based on 
calculation using values taken from <Tobi et all l2000h (failure rates 3.6% and 6.5% for the 194 set and 
201 s et, respectively), and is comparable to the results using values taken from reference feastolla et all 
l200lll (2 and 5 misclassification, failure rates of 1.0% and 2.5% for the 194 set and 201 set, respectively). 
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Consistent with previous reports llClementi et gl.lll998l) . statistical potential has about 43.8% (81 out of 
194) and 43.2% (87 out of 201) fail ure rates for the 194 set an d the 201 set, respectively. 

An updated study to refe rence iVendr uscolo et a/.L l2000bh reported perfect discrimination for 1,000 
proteins from folding decoys feastollaetanT200ll) ~"Qur results cannot be directly compared with this 
study, because man y of the test protein s or their homologs in our study are likely to be included in 
the training set of ifBastolla et all hoOlf ). as it is the union of proteins in the Whatif database and 
the Pdbselect databas e. In addition, it is n ot clear whether all decoys generated by gap less threading 
were tested in reference feastolla et aZlfeoOlft . This makes a direct comparison of the two studies rather 
difficult. 

It is informative to examine the four misclassified proteins by the kernel folding scoring function 
(lbx7, lhta, lops, and 3ezm.a). Hirustasin lbx7 contain five disulfide bonds, which are not modeled 
explicitly by the protein description, lhta (histone Hmfa) exists as a tetramer in complex with DNA 
under physiological condition. Its native structure may not be the same as that of a lone chain. The two 
terminals of this protein are rather flexible, and their conformations are not easy to determine. Among 
the 13 native sequences misclassified by the kernel design scoring function (Ibd8,lbx7, lbxy.a, lcku.a, 
ldpt.a, lflt.v, lhta, lmro.c, lops, lpsr.a, lrb9, lubp.b, 3ezm.a), several have extensive interchain 
interactions, although the contact ratio is below the rather arbitrary threshold of 30%: Contact Ratio of 
24% for lmor . c, 19% forlupb . b, 24% for If It . v, 15% for lpsr . a, and 13% for lqav . a. It is likely that the 
substantial contacts with other chains would alter the confirmation of a protein, lcku. a (electron transfer 
protein) contains an iron/sulfur cluster, which covalently bind to four Cys residues and prevent them 
from forming 2 disulfide bonds. These covalent bonds are not moldeled explicitly, lbvf (oxidoreductase) 
is complexed with a heme and an FMN group. The conformations of lcku.a may be different upon 
removing of these functionally important hetero groups. Altogether, there are some rationalization for 8 
of the 13 misclassified proteins. 

In many cases, the misclassification of some native conformations are often indicat ive of the peculiar 
nature of the protein structures. This is true for both linear scoring function reported in llVendruscolo and Domanvl 
ll998l : IVendruscolo et aZ.l l2000bT) and the nonlinear kernel function developed in this study. For example, 
the misclassified proteins are often peptide chains stabilized by other chains, or by interactions with 
cofactors, or are small fragments whose interactions are modified by crystal lattice interactions, or are 
NMR structures which are less compact and less stable than X-ray structures. Although in this study 
we attempted to alleviate such complications by eliminating very short peptide fragments and excluding 
proteins with over 30% interchain contacts, it is unlikely all problemat ic protein structures can be com- 
pletely eliminated from the training set. As shown by Bastolla et al in iBastolla et all 12001"). the design 
of optimized scoring function is likely to be open to the presence of wrong samples when a large training 
set is used. 

For protein folding scoring functions derived from simple decoys generated by gapless threading, a 
more challenging test is to discriminate native proteins from an ensemble of explicitl y generated three 
dimensional decoy structures with a significant number of near-native conformations jPark and Levitt! 
1996: ISamudrala and Moultlll998ft . Here we evaluate the performance of nonlinear scoring functions using 
three decoy sets from the database "Decoys 'R' Us" iSamudrala and Levittll200oTk the 4state_reduced 
set, the Lattice_ssfit set, and the lmsd set. We comp are our results in perf ormance with results re- 
ported in literature using optim al linear scoring function IITobi and E lbcr. 2000.j) and statistical potential 
iMivazawa and JerniganL 1996) (Table For the 4state_reduced set of decoys, nonlinear folding 
scoring function has the best performance in terms of identifying the native structure, with only one 
misclassification (2cro). The correlation of root mean square distance (RMSD) of conformations to the 
native structure and score value in the 4STATE set are shown in Fig 2] Although the performance of 
discriminating explicit generated challenging decoys is not as good as that of discriminating decoys gen- 
erated by threading, it is likely that nonlinear kernel scoring functions can be further improved if more 
realistic structural decoys are included in training. The generation of realistic structural decoys is more 
involved. Several m ethods have been deve loped for generating realistic decoys, including the original 
"buil d-up" method ijPark and LevitA Il99fl) . thos e with additional en ergy minimization l|Loose et all 
l2004h . and method based on fragment assembly iSimons et all fl 997^1 . In additoin, effective strategy 
of sequential importance sampling has also been propose d to generate prote in-like long chain compact 
self-avoiding walk to overcome the attribution problem jZhang et all I2003T) . This approach has been 
applied to gener ate realistic decoys . Preliminary results of deriving scoring funciton using such decoys 
can be found in llZhang et all l2004h . 
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Figure 4: Correlation of scores of decoys evaluated by nonlinear kernel folding scoring function and their RMSD 
values to the native proteins in the 4state_reduced set. 

Nonlinear Scoring Function for Folding and Design. Sequence decoys and structure decoys 
in general lead to different scoring functions. For example, the contact count vectors c can be very 
different for a sequence decoy of a protein and a structure decoy of the same protein. The discrimination 
surface defined by the design scoring function and the folding scoring function therefore may be different. 
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Table 6: Results of discrimination of native structures from decoys using nonli near kernel scoring functions . The 
decoy sets include 4state_reduced set, Lattice_ssfit set, and lmsd set l|Samudrala and LevTttt l2000). The 
rank of the native structure and its z-score are listed. The correlation coefficient R is also listed in parenthesis 
for the 4state_REDUCED set. KFF stands for kernel folding scoring function, and KDF stands for kernel design 
scoring function. TE -13 scoring funct i on is linear distance based scoring function optimized by linear programming , 
taken as reported in (|Tobi and I Elberl EoOOV BFKV the linear scoring function reporte d in (|Bastolla ef a/1 [20011. 
and MJ is the statistical scoring function as reported in ( Mivazawa and Jernigan, 1996). Results for TE-13 scoring 
function and Miyazawa- Jernigan scoring function are taken from Table II of ([Tobi and El berl 12000V 



1. 4state_reduced 



Protein 


# of decoys 


KFF 


KDF 


MJ 


TE-13 


BFKV 


lctf 


631 


1/3.64(0.49) 


1/3.14(0.55) 


1/3.73 


1/4.20 


2/3.00 


lr69 


676 


1/3.77(0.45) 


1/3.79(0.55) 


1/4.11 


1/4.06 


1/4.30 


lsn3 


661 


1/2.15(0.24) 


27/1.79(0.41) 


2/3.17 


6/2.70 


1/2.89 


2cro 


675 


3/2.57(0.54 


1/2.66(0.61) 


1/4.29 


1/3.48 


2/2.91 


3icb 


654 


1/2.56(0.70) 


1/2.68(0.74) 






1/2.96 


4pti 


688 


1/4.17(0.41) 


1/2.79(0.54) 


3/3.16 


7/2.43 


1/3.49 


4rxn 


678 


1/3.45(0.47) 


7/1.99(0.53) 


1/3.09 


16/1.97 


1/3.32 


2. Iattice_ssfit 


Protein 


# of decoys 


KFF 


KDF 


MJ 


TE-13 


Bastolla 


lbeo 


2001 


15/2.45 


1/3.94 






1/3.70 


lctf 


2001 


1/3.76 


1/5.35 


1/5.35 


1/6.17 


1/4.66 


ldkt 


2001 


17/2.42 


8/2.64 


32/2.41 


2/3.92 


4/3.38 


lfca 


2001 


56/2.00 


98/1.76 


5/3.40 


36/2.25 


14/2.56 


lnkl 


2001 


1/3.60 


1/3.51 


1/5.09 


1/4.51 


1/4.53 


lpgb 


2001 


1/3.95 


1/4.91 


3/3.78 


1/4.13 


1/3.41 


ltrl 


2001 


56/1.97 


18/2.67 


4/2.91 


1/3.63 


90/1.75 


4icb 


2001 


1/3.92 


1/5.31 






1/4.39 


3. lmsd 


Protein 


# of decoys 


KFF 


KDF 


MJ 


TE-13 


Bastolla 


lbOn-B 


498 


406/-0.94 


19/2.05 






257/-0.03 


lbba 


501 


500/-3.58 


487/- 1.83 






500/-3.31 


lctf 


498 


1/3.62 


1/3.31 


1/3.86 


1/4.13 


1/2.92 


ldtk 


216 


59/0.64 


185/-1.11 


13/1.71 


5/1.88 


54/0.74 


lfc2 


501 


501/-3.08 


486/-1.87 


501/-6.24 


14/2.04 


501/-3.84 


ligd 


501 


1/5.18 


1/3.93 


1/3.25 


2/3.11 


6/2.68 


lshf-A 


438 


5/2.14 


12/1.82 


11/2.01 


1/4.13 


1/3.28 


2cro 


501 


2/2.65 


1/3.24 


1/5.07 


1/3.96 


1/4.59 


2ovo 


348 


1/3.11 


38/1.21 


2/3.25 


1/3.62 


40/1.15 


4pti 


344 


1/3.14 


108/0.62 






10/1.86 



There are 220 out of 440 native proteins participating in design scoring function, and 214 out of 440 
native proteins participating in folding scoring function. There are 199 proteins that appear both in 
folding and design scoring functions. The majority of the native proteins have similar a values for both 
folding and design scoring functions. Fig shows the difference Aai of the coefficient a; for protein i 
appearing in both folding scoring function and design scoring function. In most cases, Aai values are 
small. That is, most native proteins contribute similarly in design scoring function and in folding scoring 
function. This is expected, because the main differences between the two scoring functions are due to 
differences in decoys. Out of the top 20 proteins with the largest |a<| values, 11 are common for both 
folding and design scoring functions. It is possible that the score values by kernel folding scoring function 
and by kernel design scoring function may be similar for many structure-sequence pairs (s, a). Figure|SJi 
shows that the 194 proteins in the test set have similar score values by the kernel folding and kernel 
design scoring functions. 
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Figure 5: The difference in contribution to the scoring function for the 199 native protein structures that partic- 
ipate in both folding and design scoring functions. They are sorted by Aa = ctd es jg n — "folding- "*" ne ma j° r ity 
of them have Aa close to 0. 

We also compare the value of the scoring functions for each of the 210 unit vector ci = {1, 0, . . . , 0} T , • • ■ , 
and C210 = {0, . . . , 1} T . We normalize these values so maxH(ci) = 1 for both scoring functions (Fig|U>). 
There is strong correlation (R = 0.94) for folding and design scoring functions. 

However, other methods reveal that kernel folding and design scoring functions are different. One 
method is to compare the scores of a subset of decoy structures that are challenging. That is, we compare 
evaluated scores of decoys with Oj / 0. Figl^t shows that for decoys appearing in the design scoring 
functions, there is little correlation in scores calculated by design scoring function and by folding scoring 
function. Similarly, there is no strong correlation between scores calculated by folding scoring function 
and by design scoring function for the set of structure decoys entering the design scoring function (Fig|BJl). 
It seems that although the values of qats are similar for the majority of the native proteins, design 
scoring function and folding scoring function can give very different score values for some conformations. 
This suggests that the overall fitness for design and folding potential may be different. However, since 
all empirical scoring functions derived from optimization and protein structures depend on the choice 
of traning set proteins and decoys, we cannot rule out the alternative explanation that the observed 
difference between design and folding scoring functions may be due to the difference of the decoy sets. 

Remarks. Our goal in this study is to explore an alternative formulation of scoring function and assess 
the effectiveness of this new approach with experimental data. The nonlinear scoring functions obtained 
in this study shou ld be further improved. For example, unlike the study of optimal linear scoring function 
llTobi et all l200oTl . where explicitly generated three-dimensional decoys structures are used in training, 
we used only structure decoys generated by threading. The test results using the 4state_reduced set 
and the lattice_ssfit are comparable or better with other residue-based scoring function (see Fig |1] 
and Table [(jj. It is likely that further incorporation of explicit three-dimensional decoy structures in the 
training set would improve the protein scoring function. 

The evaluation of the nonlinear scoring function requires more computation than linear function, but 
the time require is modest: on an AMD AThlon MP1800+ machine of 1.54 GHz clock speed with 2 GB 
memory, we can evaluate the scoring function for 8,130 decoys per minute. 

Overfitting can be a problem in discrimination. Overfitting occurs when the scoring function predicts 
accurately the outcomes of training set data, but performs poorly when challenged with unrelated and 
unseen test data. Although our scoring function involves a large number of basis set proteins and decoys, 
it does not suffer from overfitting, because it has good performance in blind test of discriminating native 
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Figure 6: Comparison of kernel design scoring function (KDF) and nonlinear kernel folding scoring function (KFF). 
(a). The score values by KFF and by KDF for the 194 proteins are strongly correlated. The correlation coefficient 
is R = 0.90. (b). The score values of the nonlinear design and folding scoring functions for the 210 unit vectors 
are strongly correlated (R = 0.94). (c). The score values by both design scoring functions and by folding scoring 
functions for decoys that enter the nonlinear design functions are poorly correlated, (d). The score values for 
decoys that enter the nonlinear folding scoring functions are also poorly correlated. 

proteins from both structural and sequence decoys. 

In pursuit of improved sensitivity and specificity in discrimination, the number of reference decoy and 
native structures currently entering the scoring function is large (e.g., 1,685 decoys and 220 native proteins 
for design scoring function). However, we expect the scoring function to be significantly simplified and 
the number of basis proteins and decoys reduced considerably. The use of 1-norm i nstead of 2-norm in the 
objec tive function of Equation will automatically reduce the number of vectors dScholkopf and Smolal 
|2002j). In addition, new techniques such as finite Newton method for reduced support vector machine 
has recently shown great prom ise in further reducing the number of support vector s , whe re a reduction 
ratio of 1% has been reported llLee and Mangasarianl . l200ll : iFung and Mangasarianl . 1200^) . 

Conclusion. We found in this study that no linear scoring function exists that can discriminate a 
training set of 440 native sequence from 14 million sequence decoys generated by gapless threading. The 
success of nonlinear scoring function in perfect discrimination of this training set proteins and its good 
performance in an unrelated test set of 194 proteins is encouraging. It indicates that it is now possible to 
characterize simultaneously the fitness landscape of many proteins, and nonlinear kernel scoring function 
is a general strategy for developing effective scoring function for protein sequence design. 

Our study of scoring function for sequence design is a much smaller task than developing a full- 
fledged fitness function, because we study a restricted version of the protein design problem. We need to 
recognize only one sequence that folds into a known structure from other sequences already known to be 
part of a different protein structure, whose identity is hidden during training. However, this simplified 
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Table 7: The top twenty proteins with the largest a value among 199 proteins entering both kernel folding 
scoring function and kernel design scoring function. The a value, the protein class as defined by SCOP, and the 
number of residues are also listed. 









Kernel Desi 


jn Scoring Function 


Kernel Folding Scoring Function 


Index 


pdb 




a 


Class 


Number 


pdb 




a 


Class 


Number 








value 




of resides 






value 




of resides 


1 


2por 




130.88 


Membrane / cell 


301 


2spc 


a 


25.95 


a 


107 


2 


lprn 




96.73 


Membrane / cell 


289 


2por 




23.19 


Membrane / cell 


301 


3 


2spc 


a 


52.27 


Q 


107 


lprn 




14.31 


Membrane / cell 


289 


4 


lnsy 


a 


51.41 


a/P 


271 


lrop 


a 


13.28 


Q 


56 


5 


3pch 


m 


45.22 


P 


236 


2wrp 


r 


11.41 


a 


104 


6 


lbkj 


a 


40.37 


Q + /3 


239 


lnsy 


a 


10.68 


a/P 


271 


7 


lxjo 




36.02 


a/P 


276 


lapy 


a 


10.12 


a + P 


161 


8 


lbdb 




34.26 


a/P 


276 


ltgs 


i 


9.83 


Small 


56 


9 


lppr 


m 


31.70 


a 


312 


3pch 


m 


9.66 


P 


236 


10 


lfiv 


a 


27.48 


p 


113 


ldan 


1 


8.80 


Small 


132 


11 


lhcz 




27.23 


p 


250 


7ahl 


a 


8.78 


Membrane / cell 


293 


12 


ltta 


a 


27.16 





127 


2ilk 




8.72 


Q 


155 


13 


7ahl 


a 


26.69 


Membrane / cell 


293 


lppr 


m 


8.25 


Q 


312 


14 


2rhe 




26.24 


P 


114 


lbkj 


a 


8.09 


a + P 


239 


15 


3pch 


a 


26.23 


P 


200 


lcot 




8.04 


a 


121 


16 


lsnc 




26.10 


P 


135 


lwht 


b 


7.54 


a/p 


153 


17 


lwht 


b 


24.69 


a/P 


153 


lvps 


a 


7.25 


P 


285 


18 


lcot 




23.80 


a 


121 


lvls 




7.05 


a 


146 


19 


lbvl 




23.58 


P 


159 


lsnc 




6.48 


P 


135 


20 


2kau 


b 


22.45 


P 


101 


lcmb 


a 


6.48 


a 


104 



task is challenging, because the native sequences and decoy sequences in this case are all taken from real 
proteins. Success in this task is a prerequisite for further development of a full-fledged universal scoring 
function. A full solution to the sequence design problem will need to incorporate additional sequences of 
structural homologs as native sequences, as well as additional decoys sequences that fold into different 
fold, and decoy sequences that are not proteins (e.g., all hydrophobes). It is our hope that the functional 
form and the optimization technique introduced here will also be useful for such purposes. 

In summary, we show in this study an alternative formulation of scoring function using a mixture of 
Gaussian kernels. We demonstrate that this formulation can lead to effective design scoring function that 
characterize fitness landscape of many proteins simultaneously, and perform well in blind independent 
tests. Our results suggest that this functional form different from the simple weighted sum of contact 
pairs can be useful for studying protein design and protein folding. This approach can be generalized 
for any other protein representation, e.g., with descriptors for explicit hydrogen bond and higher order 
interactions. 
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7 Appendix 



Lemma 1 For a scoring function in the form of weighted linear sum of interactions, a decoy always has 
score values higher than the native structure by at least an amount of b > 0, i.e., 

w ■ (c D - cat) > b for all {(c D - c N )\D G V and N G TV} (9) 

if and only if the origin is not contained within the convex hull of the set of points {(c_d — cjv)|D G 
V and N G N}. 

Proof: Suppose that the origin is contained within the convex hull A = conv({co — cjv}) of {cn — cjv} 
and Equation Q holds. By the definition of convexity, any point inside or on the convex hull A canbe 
expressed as convex combination of points on the convex hull. Specifically, we have: 

0= 22 ^c D -c N ■ (c D - c N ), and ^ A CD _ CJV = 1, A CD _ CJV > 0. 
(c D —c N )eA 

That is, we have the following contradiction: 

= w ■ = w ■ ^2 K D - CN ■ {c D - c N ) = ^2 \c D ,c N ) ■ w ■ (c D - c w ) > ^2 K D -c N -b = b. 

C D~ C N C D~ C N C D~ C N 

Because the convex hull can be defined as the intersection of half hyperplanes derived from the 
inequalities, if a half hyperplane has a distance b > to the origin, all points contained within the convex 
hull will be on the other side of the hyperplane llEdelsbrunnerl Il987fl . Therefore, w ■ (cd — cm) > b will 
hold for all {(cd ~ cjv)}- ■ 



References 

Bastolla, U., Farwer, J., Knapp, E. and Vendruscolo, M. (200f) How to guarantee optimal stability for 
most representative structurs in the protein data bank, Proteins, 44, 79-96. 

Ben-Naim, A. (1997) Statistical potentials extracted from protein structures: Are these meaningful 
potentials?, J. Chem. Phys., 107, 3698-3706. 

Betancourt, M. and Thirumalai, D. (1999) Pair potentials for protein folding: Choice of reference states 
and sensitivity of predicted native states to variations in the interaction schemes, Protein Sci., 8, 
361-369. 

Burges, C. J. C. (1998) A Tutorial on Support Vector Machines for Pattern Recognition, Knowledge 
Discovery and Data Mining, 2, URL /papers/Burges98 .ps . gz 

Chiu, T. and Goldstein, R. (1998) Optimizing energy potentials for success in protein tertiary structure 
prediction, Folding Des., 3, 223-228. 

Clementi, C, Maritan, A. and Banavar, J. (1998) Folding, design, and determination of interaction 
potentials using off-lattice dynamics of model heteropolymers, Phys. Rev. Lett., 81, 3287-3290. 

Dahiyat, B. and Mayo, S. (1997) De Novo protein design: Fully automated sequence selection, Science, 
278, 82-87. 

DeGrado, W., Summa, C, Pavone, V., Nastri, F. and Lombardi, A. (1999) De novo design and structural 
characterization of proteins and metalloproteins, Annu. Rev. Biochem., 68, 779-819. 

Desjarlais, J. and Handel, T. (1995) De novo design of the hydrophobic cores of proteins, Protein Sci., 
19, 244-255. 

Deutsch, J. and Kurosky, T. (1996) New algorithm for protein design, Phys. Rev. Lett., 76, 323-326. 

Dima, R., Banavar, J., Cieplak, M. and Maritan, A. (2000) Scoring functions in protein folding and 
design, Protein Sci, 9, 812-819. 



21 



Drexler, K. (1981) Molecular engineering: an approach to the development of general capabilities for 
molecular manipulation, Proc Natl Acad Sci USA, 78, 5275-5278. 

Edelsbrunner, H. (1987) Algorithms in combinatorial geometry, Springer- Verlag, Berlin. 

Edelsbrunner, H. (1995) The union of balls and its dual shape, Discrete Comput Geom, 13, 415-440. 

Emberly, E. G., Wingreen, N. S. and Tang, C. (2002) Designability of alpha-helical proteins, Proc Natl 
Acad Sci USA, 99, 11163-11168, URL http://www.pnas.org/cgi/content/abstract/99/17/11163 

Fain, B., Xia, Y. and Levitt, M. (2002) Design of an optimal Chebyshev-expanded discrimination function 
for globular proteins, Protein Sci., 11, 2010-2021. 

Friedrichs, M. and Wolynes, P. (1989) Toward protein tertiary structure recognition by means of asso- 
ciative memory hamiltonians, Science, 246, 371-373. 

Fung, G. and Mangasarian, O. L. (2002) Finite newton method for lagrangian support vector machine 
classification, Technical Report 02-01, Data Mining Institute, Computer Sciences Department, Uni- 
versity of Wisconsin, URL ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/02-01.ps 

G.A., L., Desjarlais, J. and Handel, T. (1997) De novo design of the hydrophobi core of ubiquitin, Protein 
Sci., 6, 1167-1178. 

Goldstein, R., Luthey-Schulten, Z. and Wolynes, P. (1992) Protein tertiary structure recognition using 
optimized hamiltonians with local interactions, Proc. Natl. Acad. Sci. USA, 89, 9029-9033. 

Hao, M. and Scheraga, H. (1996) How optimization of potential functions affects protein folding, Proc 
Natl Acad Sci USA, 93(10), 4984-4989. 

Hao, M.-H. and Scheraga, H. (1999) Designing potential energy functions for protein folding, Curr 
Opinion Structural Biology, 9, 184-188. 

Jernigan, R. and Bahar, I. (1996) Structure-derived potentials and protein simulations, Curr. Opin. 
Struct. Biol., 6, 195-209. 

Joachims, T. (1999) Advances in Kernel Methods - Support Vector Learning, chapter Making large-scale 
SVM learning practical, MIT Press. 

Jones, D., Taylor, W. and Thornton, J. (1992) A new approach to protein fold recognition, Nature, 358, 
86-89. 

Karmarkar, N. (1984) A new polynomial-time algorithm for linear programming, Combinatorica, 4, 373- 
395. 

Khatun, J., Khare, S. D. and Dokholyan, N. V. (2004) Can contact potentials reliably predict stability 
of proteins?, J. Mol. Biol., 336, 1223-1238. 

Koehl, P. and Levitt, M. (1999a) De Novo protein design. I. In search of stability and specificity, J. Mol. 
Biol., 293, 1161-1181. 

Koehl, P. and Levitt, M. (1999b) De Novo protein design. II. Plasticity of protein sequence, J. Mol. Biol., 
293, 1183-1193. 

Koretke, K., Luthey-Schulten, Z. and Wolynes, P. (1996) Self-consistently optimized statistical mechan- 
ical energy functions for sequence structure alignment, Protein Sci, 5, 1043-1059. 

Koretke, K., Luthey-Schulten, Z. and Wolynes, P. (1998) Self-consistently optimized energy functions for 
protein structure prediction by molecular dynamics, Proc Natl Acad Sci USA, 95(6), 2932-2937. 

Kuhlman, B. and Baker, D. (2000) Native protein seqeuences are clsoe to optimal for their structures, 
Proc. Natl. Acad. Sci. USA, 97, 10383-10388. 



22 



Kuhlman, B., Dantas, G., Ireton, G. C, Varani, G., Stoddard, B. L. and Baker, D. (2003) Design of a 
novel globular protein fold with atomic-level accuracy, Science, 302, 1364-1368. 

Lee, Y.-H. and Mangasarian, O. L. (2001) RSVM : reduced support vector machines, in Proceedings of 
the First SIAM International Conference on Data Mining, Chicago, IL, cD-ROM. 

Lemer, C, Rooman, M. and Wodak, S. (1995) Protein-structure prediction by threading methods - 
evaluation of current techniques, Proteins, 23, 337-355. 

Li, H., Helling, R., Tang, C. and Wingreen, N. (1996) Emergence of preferred structures in a simple 
model of protein folding, Science, 273, 666-669. 

Li, X., Hu, C. and Liang, J. (2003) Simplicial edge representation of protein structures and alpha contact 
potential with confidence measure, Proteins, 53, 792-805. 

Li, X. and Liang, J. (2004) Cooperativity and anti-cooperativity of three-body interactions in proteins, 
J. Phys. Chem. B., In review. 

Liang, J., Edelsbrunner, H., Fu, P., Sudhakar, P. and Subramaniam, S. (1998) Analytical shape comput- 
ing of macromolecules I: Molecular area and volume through alpha-shape., Proteins, 33, 1-17. 

Loose, C, Klepeis, J. and Floudas, C. (2004) A new pairwise folding potential based on improved decoy 
generation and side-chain packing., Proteins, 54, 303-314. 

Lu, H. and Skolnick, J. (2001) A distance-dependent atomic knowledge-based potential for improved 
protein structure selection, Proteins, 44, 223-232. 

Maiorov, V. and Crippen, G. (1992) Contact potential that pecognizes the correct folding of globular 
proteins, J. Mol. Biol., 227, 876-888. 

Meller, J., Wagner, M. and Elber, R. (2002) Maximum feasibility guideline in the design and analysis of 
protein folding potentials, J. Comput. Chem., 23, 111-118. 

Meszaros, C. (1996) Fast Cholesky factorization for interior point methods of linear programming, Comp. 
Math. Appl., 31, 49 - 51. 

Micheletti, C, Seno, F., Banavar, J. and Maritan, A. (20001) Learning effective amino acid interactions 
through iterative stochastic techniques, Proteins, 42(3), 422-431. 

Mirny, L. and Shakhnovich, E. (1996) How to derive a protein folding potential? a new approach to an 
old problem, J. Mol. Biol., 264, 1164-1179. 

Miyazawa, S. and Jernigan, R. (1985) Estimation of effective interresidue contact energies from protein 
crystal structures: quasi-chemical approximation, Macromolecules, 18, 534-552. 

Miyazawa, S. and Jernigan, R. (1996) Residue-residue potentials with a favorable contact pair 
term and an unfavorable high packing density term, J. Mol. Biol., 256, 623-644, URL 
citeseer .nj .nec . com/388482 .html 

Munson, P. and Singh, R. (1997) Statistical significane of hierarchical multi-body potential based on 
delaunay tessellation and their application in sequence-structure alignment, Protein Sci, 6, 1467-1481. 

Pabo, C. (1983) Designing proteins and peptides., Nature, 301, 200. 

Park, B. and Levitt, M. (1996) Energy functions that discriminate x-ray and near-native folds from 
well-constructed decoys, J. Mol. Biol., 258, 367-392. 

Rossi, A., Micheletti, C, Seno, F. and Maritan, A. (2001) A self-consistent knowledge-based approach 
to protein design, Biophys J, 80(1), 480-490. 

Samudrala, R. and Levitt, M. (2000) Decoys 'R' us: a database of incorrect conformations to improved 
protein structure prediction., Protein Sci., 9, 1399-1401. 



23 



Samudrala, R. and Moult, J. (1998) An all-atom distance-dependent conditional probability discrimina- 
tory function for protein structure prediction, J. Mol. Biol., 275, 895-916. 

Scholkopf, B. and Smola, A. (2002) Learning with kernels: Support vector machines, regularization, 
optimization, and beyond, The MIT Press, Cambridge, MA. 

Shakhnovich, E. (1998) Protein design : a perspective from simple tractable models, Folding & Design, 
3, R45-R58. 

Shakhnovich, E. and Gutin, A. (1993) Engineering of stable and fast-folding sequences of model proteins, 
Proc. Natl. Acad. Sci. USA., 90, 7195-7199. 

Simons, K. T., Kooperberg, C, Huang, E. and Baker, D. (1997) Assembly of protein tertiary structures 
from fragments with similar local sequences using simulated annealing and bayesian scoring functions, 
J. Mol. Biol., 268, 209-225. 

Simons, K. T., Ruczinski, I., Kooperberg, C, Fox, B., Bystroff, C. and Baker, D. (1999) Improved 
recognition of native-like protein structures using a combination of sequence-dependent and sequence- 
independent features of proteins, Proteins, 34, 82-95. 

Sippl, M. (1995) Knowledge-based potentials for proteins, Curr. Opin. Struct. Biol., 5(2), 229-235. 

Slovic, A. M., Kono, H., Lear, J. D., Saven, J. G. and DeGrado, W. F. (2004) From the Cover: Compu- 
tational design of water-soluble analogues of the potassium channel KcsA, Proc Natl Acad Sci USA, 
101, 1828-1833, URL http://www.pnas.Org/cgi/content/abstract/101/7/1828 

Tanaka, S. and Scheraga, H. (1976) Medium- and long-range interaction parameters between amino acids 
for predicting three-dimensional structures of proteins, Macromolecules, 9, 945-950. 

Thomas, P. and Dill, K. (1996a) An iterative method for extracting energy-like quantities from protein 
structures, Proc Natl Acad Sci USA, 93, 11628-11633. 

Thomas, P. and Dill, K. (1996b) Statistical potentials extracted from protein structures: How accurate 
are they?, J. Mol. Biol., 257, 457-469. 

Tobi, D. and Elber, R. (2000) Distance-dependent, pair potential for protein folding: Results from linear 
optimization, Proteins, 41, 40-46. 

Tobi, D., Shafran, C, Linial, N. and Elber, R. (2000) On the design and analysis of protein folding 
potentials, Proteins, 40, 71-85. 

Vapnik, V. (1995) The Nature of Statistical Learning Theory, Springer, N.Y. 

Vapnik, V. and Chervonenkis, A. (1964) A note on one class of perceptrons, Automation and Remote 
Control, 25. 

Vapnik, V. and Chervonenkis, A. (1974) Theory of Pattern Recognition [in Russian] , Nauka, Moscow, 
(German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie- 
Verlag, Berlin, 1979). 

Vendruscolo, M. and Domany, E. (1998) Pairwise contact potentials are unsuitable for protein folding, 
J. Chem. Phys., 109, 11101-11108. 

Vendruscolo, M., Najmanovich, R. and Domany, E. (2000a) Can a pairwise contact potential stabilize 
native protein folds against decoys obtained by threading?, Proteins, 38, 134-148. 

Vendruscolo, M., Najmanovich, R. and Domany, E. (2000b) Can a pairwise contact potential stabilize na- 
tive protein folds against decoys obtained by threading?, Proteins: Structure, Function, and Genetics, 
38, 134-148. 

Vriend, G. and Sander, C. (1993) Quality control of protein models - directional atomic contact analysis, 
J. Appl. Cryst., 26, 47-60. 



24 



Wernisch, L., Hery, S. and Wodak, S. (2000) Automatic protein design with all altom force-fields by 
exact and heuristic optimization, J. Mol. Biol., 301, 713-736. 

Wodak, S. and Rooman, M. (1993) Generating and testing protein folds, Curr. Opin. Struct. Biol., 3, 
247-259. 

Yue, K. and Dill, K. (1992) Inverse protein folding problem: Designing polymer sequences, Proc. Natl. 
Acad. Sci. USA., 89, 4163-4167. 

Zhang, J., Chen, R. and Liang, J. (2004) Potential function of simplified protein models for discriminating 
native proteins from decoys: Combining contact interaction and local sequence-dependent geometry, 
in 26th Annual International Conference, IEEE Engineering in Medicine and Biology Society, 2004 , 
IEEE, www.arxiv.org. 

Zhang, J., Chen, R., Tang, C. and Liang, J. (2003) Origin of scaling behavior of protein packing density: 
A sequential monte carlo study of compact long chain polymers, J. Chem. Phys., 118, 6102-6109. 

Zheng, W., Cho, S., Vaisman, I. and Tropsha, A. (1997) A new approach to protein fold recognition 
based on Delaunay tessellation of protein structure, in Altman, R., Dunker, A., Hunter, L. and Klein, 
T. (eds.), Pacific Symposium on Biocomputing'97 , pp. 486-497, World Scientific, Singapore. 



25 



